Matrix Multiplication on the Intel Touchstone Delta

نویسندگان

  • Steven Huss-Lederman
  • Elaine M. Jacobson
  • Anna Tsao
  • Guodong Zhang
چکیده

Matrix multiplication is a key primitive in block matrix algorithms such as those found in LAPACK. We present results from our study of matrix multiplication algorithms on the Intel Touchstone Delta, a distributed memory message-passing architecture with a two-dimensional mesh topology. We obtain an implementation that uses communications primitives highly suited to the Delta and exploits the single node assembly-coded matrix multiplication. Our algorithm is completely general, able to deal with arbitrary mesh aspect ratios and matrix dimensions, and has achieved parallel eeciency of 86% with overall peak performance in excess of 8 GGops on 256 nodes for an 8800 8800 matrix. We describe our algorithm design and implementation, and present performance results that demonstrate scalability and robust behavior over varying mesh topologies.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pumma: Parallel universal matrix multiplication algorithms on distributed memory concurrent computers

This paper describes the Parallel Universal Matrix Multiplication Algorithms (PUMMA) on distributed memory concurrent computers. The PIJhlMA package includes not only the non-transposed matrix multiplication routine C = A . B. but also transposed multiplication routines C = AT . B, C = A . BT, and C = AT . BT, for a block scattered data distribution. The routines perform efficiently for a wide ...

متن کامل

Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers

This paper describes parallel matrix transpose algorithms on distributed memory concurrent processors. We assume that the matrix is distributed over a P Q processor template with a block scattered data distribution. P , Q, and the block size can be arbitrary, so the algorithms have wide applicability. The communication schemes of the algorithms are determined by the greatest common divisor (GCD...

متن کامل

The Spectral Decomposition of Nonsymmetric Matrices on Distributed Memory Parallel Computers

The implementation and performance of a class of divide-and-conquer algorithms for computing the spectral decomposition of nonsymmetric matrices on distributed memory parallel computers are studied in this paper. After presenting a general framework, we focus on a spectral divide-and-conquer (SDC) algorithm with Newton iteration. Although the algorithm requires several times as many oating poin...

متن کامل

Parallel Tridiagonalization through Two-Step Band Reduction

We present a two-step variant of the \successive band reduction" paradigm for the tridiagonalization of symmetric matrices. Here we reduce a full matrix rst to narrow-banded form and then to tridiagonal form. The rst step allows easy exploitation of block orthogonal transformations. In the second step, we employ a new blocked version of a banded matrix tridiagonal-ization algorithm by Lang. In ...

متن کامل

Extended Two - Phase Method forAccessing Sections of Out - of - Core Arrays

A number of applications on parallel computers deal with very large data sets that cannot t in main memory. In such applications, data must be stored in les on disks and fetched into memory during program execution. Parallel programs with large out-of-core arrays stored in les must read/write smaller sections of the arrays from/to les. In this paper, we describe a method for accessing sections ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Concurrency - Practice and Experience

دوره 6  شماره 

صفحات  -

تاریخ انتشار 1993